What/Why/When/Where/How Framework and Faculty Development Workshop to Improve the Utility of Narrative Evaluations for Assessing Internal Medicine Residents

Introduction Clinical competency committees (CCCs) rely on narrative evaluations to assess resident competency. Despite the emphasis on these evaluations, their utility is frequently hindered by lack of sufficient detail for use by CCCs. Prior resources have sought to improve specificity of comments and use of evaluations by residents but not their utility for CCCs in assessing trainee performance. Methods We developed a 1-hour faculty development workshop focused on a newly devised framework for Department of Medicine faculty supervising internal medicine residents. The what/why/when/where/how framework highlighted key features of useful narrative evaluations: behaviors of strength and growth, contextualized observations, improvement over time, and actionable next steps. Workshop sessions were implemented at a large multisite internal medicine residency program. We assessed the workshop by measuring attendee confidence and skill in writing narrative evaluations useful for CCCs. Skill was assessed through a rubric adapted from literature on the utility of narrative evaluations. Results Fifty-four participants started the presurvey, and 33 completed the workshop, for a response rate of 61%. Participant confidence improved pre-, post-, and 3 months postworkshop. Total utility scores improved in mock evaluations from 12.4 to 15.5 and in real evaluations from 13.7 to 15.0, but only some subcomponent scores improved, with fewer improving in the real evaluations. Discussion A short workshop focusing on our framework improves confidence and utility of narrative evaluations of internal medicine residents for use by CCCs. Next steps should include developing more challenging components of narrative evaluations for continued improvement in trainee performance and faculty assessment.


Introduction
Evaluation of trainees has undergone extensive revision by the Accreditation Council for Graduate Medical Education (ACGME) with the adoption of six core competencies in 2001 and developmentally based, specialty-specific educational milestones in 2013. 1,2With these changes came the requirement that each training program establish a clinical competency committee (CCC), whose charge is to ensure trainees achieve competence for independent practice by the time of graduation as well as demonstrating ongoing development over the course of training. 1ost CCCs rely on multiple mediums to inform assessment of residents along this developmental trajectory, but the most valued source is evaluation comments from end-of rotation evaluations forms completed by faculty who directly supervise residents in the clinical setting. 3,4d-of-rotation evaluation forms vary in ability to capture resident performance.6][7][8] In fact, comments alone can accurately rank order the performance of learners. 9[8]10 The literature describes high-quality written evaluations as being specific, with contextualized examples. 6,10However, faculty face high cognitive load in converting observations to written comments as they try to recall examples or consider which might be useful in reflecting resident competency. 6,9,11Several studies have attempted to improve the quality of written evaluations.3][14][15][16][17] To our knowledge, no previous resources have been created with the goal of improving the utility of written comments for use by CCCs.
We implemented and evaluated a faculty development workshop aimed at improving the utility of narrative comments for use by the CCC at our internal medicine residency program.

Setting and Participants
We implemented and evaluated a 1-hour workshop for faculty with the goal of improving the utility of narrative comments for use by CCCs.We delivered the workshop at standing faculty development sessions in four divisions within the Department of Medicine at the University of Pittsburgh Medical Center in May-August 2021.All faculty who regularly supervised internal medicine residents were invited via email to attend.Participation in the workshop was entirely voluntary.We created the workshop to be delivered both in person and virtually.A 1-hour interactive adaptable workshop model was chosen for feasibility in reaching a large faculty body located at multiple sites in multiple divisions of varying clinical structures.

Workshop
The workshop was an interactive didactic (Appendix A) that focused on the use of a what/why/when/where/how framework (Appendix B) that not only consolidated the core features that CCC members used in evaluating residents but also decreased cognitive load of observing faculty.
Workshop content focused on (1) the value of narrative comments, (2) actionable feedback as described in Kim Scott's book Radical Candor, 18 and (3) key features of useful narrative comments using a what/why/when/where/how framework.The workshop demonstrated the framework through step-by-step analysis of examples of useful and not useful narrative comments.The framework was informed by a local formal qualitative needs assessment of our CCC, which highlighted the usefulness of comments on behavioral content, performance relative to peers, and next steps for improvement in assessing residents.These concepts were also highlighted in the literature that we incorporated in developing the what/why/when/where/how framework.The "What behaviors did you observe?" and "Why are these behaviors reflective of their performance?"components emphasized the importance of including specific behavioral examples and reflecting on why these behaviors were salient representations of performance. 19he "When in training?"and "Where were the behaviors observed?"components emphasized the contexts of the observation, allowing observing faculty to provide comparisons to a mental scale. 20The "How can the trainee improve?" component emphasized the importance of providing next steps for improvement in guiding the resident for continued improvement. 20rticipants were asked to apply the framework to mock learner videos.We created two 5-minute mock learner videos (Appendices C and D) that captured typical skills of senior internal medicine residents during an inpatient general medicine service.Participants wrote mock evaluations (MEs) immediately prior to the workshop, which demonstrated priming, and immediately after the workshop using our framework, which demonstrated deliberate practice of the new skill.

Outcomes Measured
We evaluated effects of the workshop on (1) participants' confidence in writing narrative comments, (2) utility of participants' narrative comments on mock resident cases, and (3) utility of participants' narrative comments on real end-of-rotation resident evaluations in MedHub.

Confidence survey:
We assessed confidence through electronic Qualtrics XM surveys (Appendix E) prior to, after, and 3 months after the workshop.Survey question responses were based on a 5-point Likert scale (1 = not at all confident, 5 = very confident).Each item was developed and iteratively refined with input from CCC members and observing faculty.

Utility of comments on mock resident cases:
We assessed skill development by evaluating utility of comments on mock resident cases.Participants submitted a narrative ME of the learner case prior to and after the workshop.We deidentified and randomized the pre-and postworkshop MEs.Two medicinepediatric CCC members graded each ME for its utility in providing useful information for the CCC using the utility grading rubric (Appendix F).We chose CCC members outside of the internal medicine residency program to allow for generalizability of the content, and we specifically chose medicine-pediatric CCC members for similarity of resident milestones of this discipline to internal medicine.We used the average of the two graders' item scores and total scores in the final analysis.

Utility of comments on real resident evaluations:
We assessed practice change by evaluating the utility of narrative comments written by participants on end-of-rotation MedHub real evaluation (RE) forms routinely collected by the residency program pre-and postworkshop.MedHub evaluations with no written comments were excluded.We performed deidentification, randomization, and scoring as described above for MEs.Participants were not informed of the time frame in which postworkshop narrative evaluations would be assessed in an attempt to decrease the likelihood of a Hawthorne effect.

Methods of Analysis
Development of utility grading rubric: To assess utility of narrative evaluations for use by the CCC, we created a grading rubric (Appendix F) modeled after a previously published grading rubric for assessing the overall quality of narrative evaluations. 21Our grading rubric included an overall score as well as components of the framework: behaviors of strength, behaviors of growth, context, improvement over time, and actionable next steps.The items and scale of the adapted graded rubric were discussed and iteratively optimized with a focus group of four members of the internal medicine CCC, resulting in six items rated on a 5-point Likert scale.Then, two medicine-pediatric residency CCC members scored six randomly selected sample narrative evaluations and discussed for consensus.This discussion resulted in the final grading rubric with the same six items but now rated on a 4-point Likert scale (1 = not useful, 4 = extremely useful), for a total score of 24.We assessed the rubric for interrater reliability using interclass correlation coefficients based on all analyzed MEs.

Statistical analysis:
We assessed changes in confidence pre-, post-, and 3 months postworkshop for each confidence item using a mixed-effects linear regression model fit with time as the categorical predictor variable.Bonferroni adjustments were employed when analyzing pairwise differences between time points.The evaluation utility overall score and subdomains on both the MEs and REs were also analyzed using mixedeffects linear regression models.We chose this modeling approach to account for repeated measures per participant and make maximum use of all available data.We used Stata (StataCorp) for statistical analysis and Excel for data management, deidentification, and randomization where needed, and a significance level of 5% was assumed.

Facilitator Guide
We include here a guide for facilitation of the workshop (Appendix G).
This project was designated exempt by the University of Pittsburgh Institutional Review Board.

Results
Participants Fifty-four participants started the presurvey, and 33 completed the workshop, for a response rate of 61%.The participants represented varying time on faculty, experience on the CCC, and several subspecialities (Table 1).

Utility Grading Rubric
The interclass correlation for interrater reliability for the total score on the utility grading rubric (Appendix F) was .83.

Confidence Survey
The 33 faculty who participated in the workshop and completed both pre-and postsurveys were included in the confidence analysis.All confidence items significantly improved pre-to postworkshop and sustained at 3 months: overall confidence (2.7 to 3.6 to 3.5), confidence in writing narrative evaluations  2).There was no significant difference between immediate postworkshop and 3 months postworkshop for all comparisons.
Utility of comments on mock resident cases: Thirty-seven MEs were submitted preworkshop, and 33 MEs were submitted postworkshop.The total utility score of learner case narrative evaluations significantly improved postworkshop from 12.4 to 15.5, with p < .001.Scores for specific behaviors of strength, specific behaviors of growth, context, and overall scores significantly improved.Scores for improvement over time and actionable next steps did not significantly improve (Table 3).
Utility of comments on real resident evaluations: Of the 33 participants who completed the workshop, 17 participants had REs both pre-and postworkshop that met inclusion criteria.From these 17 participants, we collected 128 REs preworkshop and 95 REs postworkshop.We randomly selected up to five evaluations for each participant for each pre-and postworkshop for the final analysis.Three postworkshop REs did not include a narrative comment and were excluded.The final analysis included a total of 73 preworkshop and 58 postworkshop REs.
The mean total utility score was 13.7 preworkshop and 15.0 postworkshop, with p < .05.Context significantly improved from 2.0 preworkshop to 2.6 postworkshop, with p < .001,but there was no significant improvement in the other domains.The overall utility score was 2.6 preworkshop and 2.8 postworkshop, with p = .07(Table 4).

Discussion
This 1-hour faculty development workshop to improve the utility of narrative evaluations in assessing internal medicine residents was designed for feasibility in impactfully reaching a broad audience through a short session focused on decreasing cognitive load.This framework and deliberate practice led to improved confidence, skill, and behavioral practice of faculty in writing narrative comments on internal medicine residents that would be useful for CCCs.
The workshop's design focused on capturing behavioral details of resident performance in narrative comments through a framework.The framework built upon prior resources that also focused on behavioral examples.Prior resources used a process flow or a reference card for competencies but not specific descriptors of behaviors useful for CCCs. 7,14e highlighted the key features of behaviors CCCs use in assessing residents that had not been captured in previous resources, namely, the components of when, where, and how/next steps. 7,13,14Additionally, the framework addressed concerns of faculty time and cognitive load by providing an anchor for these details and a reference tool to use in practice.Another study provided direct in-person feedback to faculty on their narrative comments, which can be time consuming to implement. 12This workshop relied on an easily distributable framework to guide narrative writing.Furthermore, the workshop's evaluation addressed an important question about the utility of narrative comments for the purpose of resident evaluation.Prior studies evaluated syntax for specificity or resident reception, but it was unclear if CCCs found them more useful. 13,14In the assessment of this workshop, we used a scale of utility and demonstrated local content, face validity, and interrater reliability.This scale can be used by other programs  attempting to assess local narrative comments for utility for their CCCs.
Not all subcomponents of the narrative comments improved in the MEs and REs.In the MEs, improvement over time and next steps did not improve.These components require more depth of reflection on resident performance and may also require additional faculty training.Additionally, the level of detail in the mock videos may not have been sufficient to accurately capture this skill.In the REs, only the context subcomponent improved significantly.Context of time in training and patient acuity may be simpler concepts to report.The assessment of REs shows application in both a different and a delayed setting, which may have led to loss in some detail of the skill.
An important limitation of this project is that we did not specifically assess the role of bias in narrative evaluations.
Several studies have shown that the use of language changes based on reported gender and race, 22,23 and this would be an interesting area for future study.Other limitations include delivery and assessment at a single large academic institution in a specific residency specialty, so results may not be generalizable across other institutions or specialties.Participation was voluntary across many divisions, but participation outside of general internal medicine and hospital medicine was limited.Additionally, voluntary participation may have introduced self-selection bias as participants could have been more likely to use workshop concepts in future narrative evaluations than others would.Confidence surveys were based on Likert scales, which might have introduced response bias.We are unable to comment on the effect of this workshop compared to a broader baseline including faculty who did not participate.In addition, writing narrative evaluations using the what/why/when/where/how framework may take more time and effort than traditional narrative evaluations.Finally, while the scoring tool was adapted from a previously published rubric, was reviewed and revised by local content experts, and was assessed for interrater reliability, its use may require further reliability and validity testing before application in other settings.
Next steps for continued improvement through use of the workshop's content and framework may require repetition of practice.Addressing this gap could be as simple as providing the framework as a reference at the beginning of each rotation, as a reference in the evaluation form itself, or both.Another important next step is to assess whether this framework's focus on objective comments helps decrease gender or racial bias in evaluations.If programs are able to incorporate the framework into learning management software, building in components that flag biased language may mitigate overall bias in narrative evaluations.Such an implementation would require careful assessment and validation.Additionally, this workshop relies on individual reflection on content and skills.The focus on individual reflection can be improved by allowing participants to apply the utility rubric to their evaluations as a form of ongoing formative self-feedback.Future iterations could be further improved by including small-group discussions of mock learner videos, allowing participants to learn from each other's observations and reflections, or employing a mock CCC approach.Future implementations could also provide regular application of the utility rubric to REs for ongoing formative feedback to faculty as well as assessment of the workshop's long-term impact.Finally, future implementations should assess whether subsequent interactive sessions are needed or if a reference to the framework will provide a similar effect but allow for maintained feasibility for busy academic faculty.
Overall, this short workshop provides a high-yield framework resulting in improved overall utility of real narrative comments for use by CCCs.The workshop fulfills the needs of two important stakeholders in narrative comments-CCCs and observing faculty-by addressing inconsistent views and providing a simple high-yield framework.While the workshop was implemented at a single institution, its materials are translatable to training programs with CCCs that use a developmental approach.

Table 1 .
Characteristics of Participants (N = 33) Copyright © 2024 Sekar et al.This is an open-access publication distributed under the terms of the Creative Commons Attribution license.

Table 2 .
Confidence in Writing Narrative Evaluations a Rated on a 5-point Likert scale (1 = not at all confident, 5 = very confident).Copyright © 2024 Sekar et al.This is an open-access publication distributed under the terms of the Creative Commons Attribution license.

Table 3 .
Utility of Mock Evaluations

Table 4 .
Utility of Real Evaluations